Second Lab

Today’s goals. Learn about:

  1. Monte Carlo simulations
  2. Point estimates
    • Population mean estimator.
    • Population variance estimator.

Workflow

  1. Consider examples related to theoretical concepts
  2. Using R to solve some basic questions and to understand important aspects

Let’s start

  1. Download the R code (It is briefly commented)
  2. You will find XXX where I will ask you to code a bit
  3. Part of the code after XXX, depends on successfully filling these line of code

Monte Carlo simulation

Monte Carlo simulation

Recall the concept of simulation

Simulation: Approximate a process and retrieve general results by assuming to observe the process several times.

Monte Carlo simulation

  • We rely on the so called Monte Carlo simulation, a wide class of simulation procedures based on sampling independent and identically distributed values from a process – precisely, from the underlying presumably true probability distribution of the process – and computing some numerical outputs.

  • The steps of a Monte Carlo simulation are

    • Generate \(n\) independent and identically distributed values from a process.
    • Compute a summary for this sample, a statistic.
    • Repeat the steps above \(R\) times and obtain a sample distribution for the statistic.
  • In what follows, we will consider the behavior of some sample statistics, such as the mean and the variance, through their sample distribution.

Distribution of the sample mean

Three scenarios

Suppose that \(Y_1, \ldots, Y_n\), is sequence of iid rv from

  • Case 1: \(\mathcal{N}(0,1)\) (i.e. standard normal);

  • Case 2: \({\rm t}_{3}\) (i.e. \(t\)-student with 3 degrees of freedom);

  • Case 3: \(\mbox{U}(0,1)\) (i.e. standard continuous uniform).

Goal

Determine the distribution of the sample mean \(\overline{Y}=\frac{1}{n}\sum_{i=1}^{n}Y_{i}\)

We can answer using theoretical results, but let’s investigate via simulation. Consider:

  • Number of simulations \(R=1000\);

  • Sample size \(n=30\).

Distribution of the sample mean

set.seed(1234)
R <- 1000
n <- 30

#generate the sample of size n (for the 3 distributions) R times 
samples <- array(0, c(3, n, R))
for(i in 1 : R){
  samples[1, ,i] <- rnorm(n, 0, 1)
  samples[2, ,i] <- rt(n, df = 3)
  samples[3, ,i] <- runif(n, 0, 1)
}

Note

Above we saved the results in a 3-dimensional array. Otherwise you can create 3 different matrix (of size \(n\times R\)), or according to the goal of the analysis you can compute directly some quantity of interest without saving the samples that you generate

Distribution of the sample mean

Sample mean computation

We can compute the sample mean via two alternative but equivalent methods

  • The apply() function.
  • for loop
sample_stat <- apply(samples, MARGIN = c(1, 3), FUN = mean)

# Equivalently (creating a)
sample_stat2 <- matrix(0, nrow = 3, ncol = R)
for(i in 1 : R) {
  sample_stat2[, i] <- apply(samples[, , i], 1, mean)
}
max(abs(sample_stat - sample_stat2)) # Check
[1] 0

Distribution of the sample mean: Histogram plot

par(mfrow = c(1, 3))
hist(sample_stat[1,], nclass = 30, probability = TRUE, 
     xlab="y", main= "N(0,1)", cex.main = 2.5, cex.lab = 2)
hist(sample_stat[2,], nclass = 30, probability = TRUE, 
     xlab = "y", main = expression(t[3]), cex.main = 2.5, cex.lab = 2)
hist(sample_stat[3,], nclass = 30, probability = TRUE, 
     xlab = "y", main = "U(0,1)", cex.main = 2.5, cex.lab = 2)

Distribution of the sample mean

Sample mean distribution

  • In the Gaussian case the sample mean \(\overline Y\) for iid values still follows a normal distribution, precisely \(\bar Y_n \sim \mathcal{N}(\mu, {\sigma^{2}}/{n})\).

  • For other distributions, this result holds only asymptotically, when \(n \rightarrow \infty\), due to the CLT theorem. When the sampling does not come from a Normal distribution \(\overline{Y} \overset{\cdot}{\sim} \mathcal{N}(\mu, {\sigma^{2}}/{n})\).

Time to working with R: It’s your turn

According to the previous comment, we want overlap the proper Gaussian density over the histograms.

  • This requires obtaining the parameters of the appropriate Gaussian distribution. Thus, for each distribution we must obtain \(E[\bar Y_n]\) and \(Var(\bar Y_n)\)

Distribution of the sample mean: Overlapping the Gaussian density

par (mfrow=c(1,3))

hist(sample_stat[1,], nclass = 30, probability = TRUE, 
     xlab="y", main= "N(0,1)", cex.main = 1.5)
curve(dnorm(x, 0, sqrt(1/n)), add = TRUE, col = "red", lwd = 2)

hist(sample_stat[2,], nclass = 30, probability = TRUE, 
     xlab = "y", main = expression(t[3]), cex.main = 1.5)
curve(dnorm(x, 0, sqrt((3/((3 - 2) * n)))), add = TRUE, col = "red", lwd = 2)

hist(sample_stat[3,], nclass = 30, probability = TRUE, 
     xlab = "y", main = "U(0,1)", cex.main = 1.5)
curve(dnorm(x, 1/2, sqrt(1/(12 * n))), add = TRUE, col = "red", lwd = 2)

Other graphical tools: ECDF

# Empirical Cumulative Distribution Function (ECDF)
par(mfrow = c(1,3))
plot(ecdf(sample_stat[1,]), xlab="y", main= "N(0,1)", cex.main = 1.5)
curve(pnorm(x, 0, sqrt(1/n)), add = TRUE, col = "red", lty = 2, lwd = 2)
plot(ecdf(sample_stat[2,]), xlab="y", main= expression(t[3]), cex.main = 1.5)
curve(pnorm(x, 0, sqrt((3/((3 - 2) * n)))), add = TRUE, col = "red", 
      lty = 2, lwd = 2)
plot(ecdf(sample_stat[3,]), xlab="y", main= "U(0,1)", cex.main = 1.5)
curve(pnorm(x, 1/2, sqrt(1/(12 * n))), add = TRUE, col = "red", 
      lty = 2, lwd = 2)

Other graphical tools: Normal Q-Q plot

# Normal Q-Q plot
par(mfrow = c(1, 3))
qqnorm(sample_stat[1,])
abline(0, sqrt(1/n))
qqnorm(sample_stat[2,])
abline(0, sqrt((3/((3 - 2) * n))))
qqnorm(sample_stat[3,])
abline(1/2, sqrt(1/(12 * n)))

Distribution of the sample variance under the Gaussian case

Sample variance under the Gaussian case

We know that \(S^{2}= \frac{1}{n-1} \sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}\) is an unbiased estimator for the variance (an estimator is said to be unbiased iff \(E(\widehat{{\theta}})= {\theta}\)).

The output provided by the function var() applied to a sample vector is the sample quantity \[s^2 = \frac{1}{n-1} \sum_{1}^{n}(y_i - \bar y)^2\]

Leveraging the example above, if we assume that the true generating model is normal (case 1), we know that the distribution of the sample variance is proportional to a \(\chi^{2}\) distribution with \(n-1\) degrees of freedom:

\[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^{2}_{n-1}\]

Distribution of the sample variance under the Gaussian case

Time to work with R: It’s your turn

Obtain the distribution of the sample variance via simulation, plot it via histograms/ecd and overlap the corresponding density/cumulative distribution function (using the formula above).

Remember the following transformation

Let \(X\) a r.v. with pdf \(f_X(x)\), then \(Y=g(X)\), with \(g(\cdot)\) an invertible function, has pdf \[f_Y(y)=f_X(g^{-1}(y))\Bigg | \frac{d}{dy} g^{-1}{(y)}\Bigg | \] So, in our case let \(X=\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\), then \(S^2=Y=\frac{\sigma^2}{n-1} X\) is such that \[f_Y(y)= f_X \Big(\frac{n-1}{\sigma^2}y \Big) \frac{n-1}{\sigma^2},\]

since \(g^{-1}(y) = \frac{n-1}{\sigma^2}y\) and \(\frac{d}{dy} g^{-1}{(y)}=\frac{n-1}{\sigma^2}\)

Distribution of the sample variance under the Gaussian case

par (mfrow = c(1, 2))
sigma <- 1 

# This slightly differ implemened during the lecture 
# During the lab I considered apply(samples, c(1,3), var)[1,]
sample_var <- apply(samples, c(1,3), var) 

# Histogram 
hist(sample_var[1,], nclass = 30, probability = TRUE, 
     xlab = expression(s^2), main = "Case 1: N(0,1)", cex.main = 1.5)
curve((n-1)/sigma^2 * dchisq(x * (n-1)/sigma^2,  df = n - 1), 
      add = TRUE, col = "red", lwd = 2)

# ECDF vs CDF 
plot(ecdf(sample_var[1,]), xlab = expression(s^2), main = "Case 1: N(0,1)", 
     cex.main = 1.5)
curve(pchisq(x * (n-1)/sigma^2,  df = n - 1),  add = TRUE, col = "red", lwd = 2)

Basic concepts of estimation

Point estimation

Problem of statistical inference

Let \(\mathcal{F} = \{p(y;\theta), \theta \in \Theta\}\) a parametric statistical model, with \(\Theta\) the parameter space and \(p(y; \theta)\) a collection of density (or probability) functions. Assume that \(\mathcal{F}\) is correctly specified, thus the true parameter value \(\theta^0 \in \Theta\). Some problems of the statistical inference on the parameter \(\theta\) rely on the following questions:

  1. What we can say, having observed \(y\) and assumed \(\mathcal{F}\), on where \(\theta^0\) is allocated in \(\Theta\)? This is an estimation problem: if we want obtain from \(y\) a specific value of \(\theta^0\) we are performing point estimation (previous lab), while if we want target a subset of \(\Theta\) where is likely to be included \(\theta^0\), then we are involved in a interval estimation procedure (next lab).

  2. Are the data reasonably agreeing with the hypothesis that \(\theta^0\) belong to a subset \(\Theta^0\) of \(\Theta\)? In such a case, we are involved in a hypothesis testing problem (next next lab).

Important

The true parameter value \(\theta^0\) is unknown, and carrying out inference on \(\theta\) (estimation, hypothesis testing) should be intended on \(\theta^0\)

Point estimation

Unbiasedness and consistency of estimators

Let \(\hat \theta\) be the point estimator for the parameter \(\theta\). An estimator is said to be

  • unbiased if and only if \(E(\hat \theta)=\theta\)
  • consistent if \(\hat \theta \stackrel{P}\rightarrow \theta\), as \(n \rightarrow \infty\) or, equivalently, if \(\text{var}(\hat \theta) \rightarrow 0\), as \(n \rightarrow \infty\).

Bias–Variance tradeoff and Mean Squared Error

There are some properties that we would ensure for an estimator: low variance and low bias. But, there is a tradeoff between unbiasedness and low variance, so we would usually seek to get both (to some extent); ideally, we would target a small Mean Squared Error (MSE)

\[\rm{MSE}(\hat \theta)=E \left\{\left(\hat \theta-\theta\right)^2 \right\}= \left\{E(\hat \theta ) - \theta\right\}^2+var(\hat \theta)=\left(Bias\right)^2+Variance \]

Application: Comparison of four estimators of the population mean

Four estimators of the population mean

As we already know from the theory, the sample mean is an excellent estimator for the population mean. But we will consider now other estimators. Consider the case of a normal distribution, that is assume independent \(X_i \sim \mathcal{N}(\mu, \sigma^2)\), \(i = 1, \ldots, n\) and the following estimators (below \(X_{(i)}\) is the \(i\)-th order statistic):

Estimator Formula
Sample mean \(\hat \mu_1=\frac{1}{n}\sum_{i=1}^{n}X_i\)
Sample median \(\hat \mu_2= \begin{cases} X_{(n+1)/2} & n \text{ odd} \\ (X_{n/2} + X_{n/2+1})/2 & n \text{ even}\end{cases}\)
Mid-range \(\hat \mu_3=\frac{X_{(1)}+X_{(n)}}{2}\)
Trimmed mean (10%) \(\hat \mu_4=\frac{1}{0.8n}\sum_{i=0.1n}^{0.9n}X_{(i)}\)

Note

The idea of the trimmed sample mean is to compute the mean discarding the 10% of data from the lower tail and the 10% from the upper tail (the following definition covers only the case when \(n\) is even and \(0.1n\) is an integer)

Application: Comparison of four estimators of the population mean

Goal

Compare the four estimators above in terms of unbiasedness and efficiency.

We will use Monte Carlo method to assess these properties.

Time to work with R: It’s your turn

  • Using simulation, obtain the sampling distributions of the four estimators of the mean—sample mean, sample median, midrange, and 10% trimmed mean—based on \(R\) replications of samples of size \(n\) from \(\mathcal{N}(\mu,\sigma^2)\).
  • Create a boxplot comparing the estimators and draw a reference line at \(\mu\).
  • Then compute and report, for each estimator, its bias, variance, and mean squared error (MSE).

Application: Comparison of four estimators of the population mean

R <- 1000   # Number of replications
n <- 10     # Sample size
mu <- 2     # Population mean
sigma <- 2  # Population standard deviation

est <- matrix(0, R, 4)
label_est <- c("Mean", "Median", "(Min + Max)/2", "10pTrimMean")

set.seed(1234)
for (i in 1 : R) {
 x <- rnorm(n, mu, sigma)
 est[i, 1] <- mean(x)
 est[i, 2] <- median(x)
 est[i, 3] <- (min(x) + max(x))/2
 est[i, 4] <- mean(x, trim = 0.1)
}

par(mfrow = c(1, 1), xaxt = "n")
boxplot(est, main="Comparison between four estimators")
par(xaxt = "s")
axis(1, 1 : 4, label_est)
abline(h = mu, lwd = 2, col = "blue")

Application: Comparison of four estimators of the population mean

# Bias
bias <- apply(est, 2, mean) - mu
bias
[1]  0.012231786  0.032274611 -0.005109588  0.016567130
# Variances
variance <- apply(est, 2, var)
variance
[1] 0.3674287 0.5111770 0.7202595 0.3870022
# MSE 
MSE <- variance + bias^2
MSE
[1] 0.3675783 0.5122186 0.7202856 0.3872767

Note

All the estimators appear unbiased and the sample mean register the lowest estimated variance, as well as the lower estimated MSE

Application: Comparison of four estimators of the population mean

Goal

  • Let’s check now whether all the estimators are consistent.
  • For checking this statement, \(n=10\) is extremely low and need to be increased, let’s say \(n=200, 1000\).
  • Compare the simulated values for \(n=10\) with those obtained for \(n=200\) and \(n=1000\) by using histograms. What do you expect?

Application: Comparison of four estimators of the population mean

R <- 1000
n <- c(10, 200, 1000)
mu <- 2
sigma <- 2

est <- array(0, dim = c(R, 4, 3))
label_est <- c("Mean", "Median", "(Min + Max)/2", "10pTrimMean")

set.seed(13)


for(j in 1 : length(n)){
 for (i in 1 : R) {
  x <- rnorm(n[j], mu, sigma)
  est[i, 1, j] <- mean(x)
  est[i, 2, j] <- median(x)
  est[i, 3, j] <- (max(x) + min(x))/2
  est[i, 4, j] <- mean(x, trim = 0.1)
 }
}  

# Set up 4x3 layout: 4 rows (methods), 3 columns (sample sizes)
# But we'll arrange them as 2 columns of methods
par(mfrow = c(2, 6))

for(k in 1:2){  # First column: methods 1-2
 for (j in 1 : length(n)) {
  hist(est[,k, j], nclass = 30,  probability = T, xlab = "", xlim = c(-2, 6),
       main = paste(label_est[k], "- n =", n[j]))
  abline(v = mu, col = "blue", lwd = 2)
 }
}

for(k in 3:4){  # Second column: methods 3-4
 for (j in 1 : length(n)) {
  hist(est[,k, j], nclass = 30,  probability = T, xlab = "", xlim = c(-2, 6),
       main = paste(label_est[k], "- n =", n[j]))
  abline(v = mu, col = "blue", lwd = 2)
 }
}

Application: Comparison of four estimators of the population mean

Note

Note that for sample mean estimator we know the theoretical variance \(\sigma^2/n\). The values obtained with \(n=10,200,1000\) are well approximating it.

Application: Comparison of four estimators of the population mean

# Variances
variance <- matrix(0, 3, 4)
for(j in 1:length(n)){
 variance[j,] <- apply(est[,,j], 2, var)
}
# Bias
bias <- matrix(0, 3, 4)
for(j in 1:length(n)){
 bias[j,] <- apply(est[,,j], 2, mean) - mu 
}

# MSE 
MSE <- variance + bias^2

colnames(bias) <- colnames(variance) <- colnames(MSE) <- label_est
rownames(bias) <- rownames(variance) <- 
  rownames(MSE) <- c("n = 10", "n = 200", "n = 1000" )

variance 
                Mean      Median (Min + Max)/2 10pTrimMean
n = 10   0.377966306 0.532483793     0.7084689 0.405919655
n = 200  0.020115434 0.032069233     0.3122608 0.021081454
n = 1000 0.003986159 0.005982136     0.2469938 0.004213873
bias
                  Mean        Median (Min + Max)/2   10pTrimMean
n = 10   -0.0143645259 -0.0058309337  -0.021308379 -0.0126285625
n = 200   0.0047097379  0.0105994276  -0.015492342  0.0074923632
n = 1000 -0.0004866974 -0.0005205328   0.009392445 -0.0001667324
MSE
                Mean      Median (Min + Max)/2 10pTrimMean
n = 10   0.378172645 0.532517793     0.7089229 0.406079136
n = 200  0.020137615 0.032181580     0.3125008 0.021137589
n = 1000 0.003986395 0.005982407     0.2470820 0.004213901

Comparison of unbiased and biased sample variance estimators

Note

Related to the distribution of the sample variance in the normal case, we could also rely on the biased estimator \(S^{2}_{b}= \frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}\).

Check the biased nature of \(S^{2}_b\) via MC simulation, generating \(n = 10\) iid values from a normal distribution. Compare the results with the unbiased estimator \(S^{2}\)

#Initial settings
set.seed(2)
R <- 1000
n <- 10
mu <- 0
sigma <- 1

# Save the results in two vectors:
# s2: unbiased sample variance estimates 
# s2_b: biased sample variance estimates
s2 <- rep(0, R)
s2_b <- rep(0, R)

# For each replication we generate 10 samples from
# a normal r.v. with mean mu and variance sigma^2
# and we compute the four sample variance estimates 
for(i in 1 : R) {
  y <- rnorm(n, mu, sigma)
  s2[i] <- var(y) 
  s2_b[i] <- var(y) * (n - 1)/n
}


s2_mean <- mean(s2)
s2_b_mean <- mean(s2_b)

Comparison of unbiased and biased sample variance estimators

#plot s2
par(mfrow = c(1, 2), oma = c(0, 0, 0, 0))
hist(s2, breaks = 50, xlab = expression(s^2), probability = TRUE,
     main = expression(s^2), cex.main = 1.5)
#in red the true mean, in blue the estimated mean
abline(v = sigma^2, col = "red", lwd = 2)
abline(v = s2_mean, col = "blue", lwd = 3, lty = 3) 

#plot s2 biased
hist(s2_b, breaks = 50, xlab = expression(s[b]^2), probability = TRUE,
     main = expression(s[b]^2), cex.main = 1.5)
#in red the true mean, in blue the estimated mean
abline(v = sigma^2, col = "red", lwd = 3)
abline(v = s2_b_mean, col = "blue", lwd = 3, lty = 2)